Ramayan, Subtitles: An Appeal to the Wikipedians of the World
- Publication date
- Doordarshan (Government of India)
Greetings and salutations to all of you help build the Internet Archive and the Wikipedia! This public park belongs to all the world and access to knowledge is a human right. Thank you for your service.
This page explains how we created subtitles for the Ramayan in 109 languages and what you can do to help us make the subtitles better.
The Ramayan had subtitles in the English language embedded in the video files. We used ffmpeg to extract the .srt files. Using open source code we developed, we handed each of those files to Google’s translation API, which supports 109 languages. The translation is pretty good out of the box, but it certainly isn’t perfect. Sometimes the cloud service, which is based on a neural network and machine learning, gets the translation wrong. Sometimes, the base closed-captions file we started with in English had flaws. We fixed a few of those, such a zero instead of a the letter oh. But, there is definitely room for improvement! You’ll certainly see that on the Hindi subtitles! And, sometimes, the service has trouble with complex characters, which we noticed in the Kannada files.
We’re hoping people can help us make these better and spread the tale of the Ramayana in all the languages of the earth. There are two paths you can use to make these subtitles better.
Method 1: Glossary Files
The Google API has support for something called a glossary file. These are words or very short phrases in a tab-delineated (tsv) or comma-delineated file (csv). On the left is the word in the original language, on the right is the proper translation in another language. So, let’s say that you notice that the word Ramayana is not translated properly into Kannada. You would put the following entry into your glossary file:
What if the neural network is translating a word that you don’t want translated? Well, then you would put the same phrase on the right! For example:
So, if you want to make a better Kannada translation, you would create a file called glossary.Kannada.tsv (for tab separated) or glossary.Kannada.csv (for comma separated). You only need one glossary file for the entire series for your language. When we get that glossary file, we re-translate all the subtitles from English to your language. [Please see the note below about contacting us before you embark on this path!]
Method 2: Hand-Crafted Subtitles
There’s another to solve this problem, and that is to pull up the subtitles in your language in a text editor and simply correcting them by hand. This is more tedious of course, but can yield better results. Note that these two methods are mutually exclusive: if you use the glossary approach, each file gets re-written by the software, so if you had made hand edits, they would get wiped out. But, the hand method will yield much better results if done carefully. In the case of Hindi, where we translated from English to Hindi, there is no doubt the hand method would yield far better results!
You could use the hand method if you wanted to create subtitles in a language we don’t support yet, such as Sanskrit. If you’re using the hand method, be mindful of the .srt format and in particular of the timing of the subtitles. You can see the timecodes in the .srt file. You can also use VLC to check your work. Download the .m4v video file here from the Archive, and put your .srt file right next to it. VLC will automatically make it available, or you can manually add a file.
If you’re going down either path, let’s all try and coordinate so we don’t step on each other! Drop a line to carl at media.org and let me know what which language you want to work on and which method you want to use. We’d love it if you self-organize, perhaps a group of people might work together to do Kannada or Telugu or Latin or any of the other langauges. If Wikipedians wish to self-organize this and create pages for the effort, that would also be great, and we’d be happy to point to them here! As always, let’s make sure this whole thing is open source and noncommercial.
One more thing. If you’re going to work with the Internet Archive, we highly recommend the ia command-line tool. It exercises the Internet Archive API to bring great power to citizen archivists. In the case of the current, project, here is what how you would invoke all the items:
ia download --search "collection:IndiaCulture AND subject:ramayan"
Carl Malamud for Public Resource
This library of books, audio, video, and other materials from and about India is curated and maintained by Public Resource. The purpose of this library is to assist the students and the lifelong learners of India in their pursuit of an education so that they may better their status and their opportunities and to secure for themselves and for others justice, social, economic and political.
This library has been posted for non-commercial purposes and facilitates fair dealing usage of academic and research materials for private use including research, for criticism and review of the work or of other works and reproduction by teachers and students in the course of instruction. Many of these materials are either unavailable or inaccessible in libraries in India, especially in some of the poorer states and this collection seeks to fill a major gap that exists in access to knowledge.
For other collections we curate and more information, please visit the Bharat Ek Khoj page. Jai Gyan!
- 2020-12-26 18:27:31
- Internet Archive Python library 1.9.5
Uploaded by Public Resource on